Main Features
- R6 Class; fast, flexible and modularized
- Taxa abundance plotting
- Venn diagram
- Alpha diversity
- Beta diversity
- Differential abundance analysis
- Environmental data analysis
- Null model analysis
- Network analysis
- Functional analysis
R language and its packages ecosystem are wonderful tools for data analysis. In microbial community ecology field, many packages can be used for the data analysis, such as vegan[1], ape[2] and picante[3]. However, with the development of the high-throughput sequencing techniques, the increasing data amount and complexity make the data analysis a challenge. There have been some R packages created for the community data analysis in microbial ecology, such as phyloseq[4], microbiome (https://github.com/microbiome/microbiome), microbiomeSeq (http://www.github.com/umerijaz/microbiomeSeq), ampvis2 (https://madsalbertsen.github.io/ampvis2/reference/index.html) and MicrobiomeR(https://github.com/vallenderlab/MicrobiomeR). However, we lack a flexible, comprehensive and modularized R package to analyze and manage the data. We create the microeco R package[5] for this goal (https://github.com/ChiLiubio/microeco).
All the classes in microeco package depend on the R6 class. R6 uses the encapsulated object-oriented programming paradigm, which means that R6 is a profoundly different OO system from S3 and S4 because it is built on encapsulated objects, rather than generic functions. If you are interested in the class feature, read more from ‘Advanced R’ book.
A generic is a regular function, so it lives in the global namespace. An R6 method belongs to an object so it lives in a local namespace. This influences how we think about naming. The methods belong to objects, not generics, and you call them like object$method().
R6’s reference semantics allow methods to simultaneously return a value and modify an object.
Every R6 object has an S3 class that reflects its hierarchy of R6 class.
The use of help documents in the microeco package may be a little different from other packages we often used. If you wish to see one of help documents of functions, you should search the name of the class (not the name of the function) and click the link of the function.
Many tools can be used for the bioinformatic analysis, such as QIIME[6], usearch(https://www.drive5.com/usearch/), mothur[7], and RDP(http://rdp.cme.msu.edu/). Although the format of results may be different from various tools, the main files can be classified into the following parts: (1) OTU or ASV table, i.e. the species-sample abundance table; (2) taxonomy table, the taxonomy assignments information table; (3) phylogenetic tree; (4) representative sequences. (5) Generally, it is useful to create a detailed sample information table to store all the sample information, including the environmental data and the missing values (NA).
The microtable class is designed to store the basic data for all the downstream analysis in the microeco package. At least, the OTU table (i.e. species-sample abundance table) should be provided to create microtable object. We also build another R package file2meco (https://github.com/ChiLiubio/file2meco) to read the output files of some tools into microtable object for convenience. Currently, those tools/softwares include not only commonly-used QIIME [6] and QIIME2[8], but also some metagenomic tools, such as HUMAnN [9]. In this tutorial, we use the data inside the package microeco to show the operations.
The 16S rRNA sequencing results in the example data of the package is used to show the main part of the tutorial. This dataset is the 16S rRNA gene Miseq sequencing results of wetland soils in China published by An et al.[10], who surveyed soil bacterial communities in Chinese inland wetlands (IW), coastal wetland (CW) and Tibet plateau wetlands (TW) using 16S rRNA gene amplicon sequencing method. These wetlands include both saline and non-saline samples. The sample information table have 4 columns: “SampleID”, “Group”, “Type” and “Saline”. The column “SampleID” is same with the rownames. The column “Group” represents the IW, CW and TW. The column “Type” represents the sampling region: northeastern region (NE), northwest region (NW), North China area (NC), middle-lower reaches of the Yangtze River (YML), southern coastal area (SC), upper reaches of the Yangtze River (YU), Qinghai-Tibet Plateau (QTP). The column “Saline” represents the saline soils and non-saline soils. In this dataset, the environmental factor table is separated from the sample information table. Another ITS sequencing dataset is also stored in the example data of the package[11].
# load the example data; 16S rRNA gene amplicon sequencing dataset
data(sample_info_16S)
data(otu_table_16S)
data(taxonomy_table_16S)
data(phylo_tree_16S)
data(env_data_16S)
# use pipe operator in magrittr package
library(magrittr)
# set.seed is used to fix the random number generation to make the results repeatable
set.seed(123)
# make the plotting background same with the tutorial
library(ggplot2)
theme_set(theme_bw())Make sure that the data types of sample_table, otu_table and tax_table are all data.frame as the following part shows.
[1] “data.frame”
| S1 | S2 | S3 | S4 | S5 | |
|---|---|---|---|---|---|
| OTU_4272 | 1 | 0 | 1 | 1 | 0 |
| OTU_236 | 1 | 4 | 0 | 2 | 35 |
| OTU_399 | 9 | 2 | 2 | 4 | 4 |
| OTU_1556 | 5 | 18 | 7 | 3 | 2 |
| OTU_32 | 83 | 9 | 19 | 8 | 102 |
[1] “data.frame”
| Kingdom | Phylum | Class | |
|---|---|---|---|
| OTU_4272 | k__Bacteria | p__Firmicutes | c__Bacilli |
| OTU_236 | k__Bacteria | p__Chloroflexi | c__ |
| OTU_399 | k__Bacteria | p__Proteobacteria | c__Betaproteobacteria |
| OTU_1556 | k__Bacteria | p__Acidobacteria | c__Acidobacteria |
| OTU_32 | k__Archaea | p__Miscellaneous Crenarchaeotic Group | c__ |
Sometimes, your taxonomic table may have some chaotic information, such NA, unidentified and unknown. These information can influence the following taxonomic abundance calculation. So it is necessary to clean this file using the following code. Another important part of this operation is to unify the taxonomic prefix, e.g. transforming D_1__ to p__ for phylum level.
[1] “data.frame”
| SampleID | Group | Type | Saline | |
|---|---|---|---|---|
| S1 | S1 | IW | NE | Non-saline soil |
| S2 | S2 | IW | NE | Non-saline soil |
| S3 | S3 | IW | NE | Non-saline soil |
| S4 | S4 | IW | NE | Non-saline soil |
| S5 | S5 | IW | NE | Non-saline soil |
[1] “data.frame”
| Latitude | Longitude | Altitude | Temperature | Precipitation | |
|---|---|---|---|---|---|
| S1 | 52.96 | 122.6 | 432 | -4.2 | 445 |
| S2 | 52.95 | 122.6 | 445 | -4.3 | 449 |
| S3 | 52.95 | 122.6 | 430 | -4.3 | 449 |
| S4 | 52.95 | 122.6 | 430 | -4.3 | 449 |
| S5 | 52.95 | 122.6 | 429 | -4.3 | 449 |
[1] “phylo”
Then, we create an object of microtable class. This operation is very similar with the package phyloseq[4], but microeco is more brief and simpler. The otu_table in the microtable class must be the species-sample format: rownames must be OTU names, colnames must be sample names. The required sample names must be same in rownames of sample_table and colnames of otu_table.
# In R6 class, '$new' is the original method used to create a new object of class
dataset <- microtable$new(sample_table = sample_info_16S, otu_table = otu_table_16S, tax_table = taxonomy_table_16S, phylo_tree = phylo_tree_16S)
class(dataset)[1] “microtable” “R6”
microtable class: sample_table have 90 rows and 4 columns otu_table have 13628 rows and 90 columns tax_table have 13628 rows and 7 columns phylo_tree have 14096 tips
To make the species and sample information consistent across different files in the dataset object, we can use function tidy_dataset() to trim the dataset.
microtable class: sample_table have 90 rows and 4 columns otu_table have 13628 rows and 90 columns tax_table have 13628 rows and 7 columns phylo_tree have 13628 tips
Then, we remove OTUs which are not assigned in the Kingdom "k__Archaea" or "k__Bacteria".
dataset$tax_table %<>% base::subset(Kingdom == "k__Archaea" | Kingdom == "k__Bacteria")
print(dataset)microtable class: sample_table have 90 rows and 4 columns otu_table have 13628 rows and 90 columns tax_table have 13330 rows and 7 columns phylo_tree have 13628 tips
We also remove OTUs with the taxonomic assignments “mitochondria” or “chloroplast”.
# This will remove the lines containing the taxa word regardless of taxonomic ranks and ignoring word case in the tax_table.
# So if you want to filter some taxa not considerd pollutions, please use subset like the previous operation.
dataset$filter_pollution(taxa = c("mitochondria", "chloroplast"))
print(dataset)microtable class: sample_table have 90 rows and 4 columns otu_table have 13628 rows and 90 columns tax_table have 13296 rows and 7 columns phylo_tree have 13628 tips
Then, to make the OTUs same in otu_table, tax_table and phylo_tree, we use tidy_dataset() again.
microtable class: sample_table have 90 rows and 4 columns otu_table have 13296 rows and 90 columns tax_table have 13296 rows and 7 columns phylo_tree have 13296 tips
Then we use sample_sums() to check the sequence numbers in each sample.
[1] 10316 37087
Sometimes, in order to reduce the effects of species number on the diversity measurements, we need to perform the resampling to make the sequence number equal for each sample. The function rarefy_samples can invoke the function tidy_dataset automatically before and after the rarefying.
# As an example, we use 10000 sequences in each sample
dataset$rarefy_samples(sample.size = 10000)
dataset$sample_sums() %>% range[1] 10000 10000
Then, we calculate the taxa abundance at each taxonomic rank using cal_abund(). This function return a list called taxa_abund containing several data frame of the abundance information at each taxonomic rank. The list is stored in the microtable object automatically.
[1] “list”
If you want to save the taxa abundance file to a local place, use save_abund().
Then, we calculate the alpha diversity. The result is also stored in the object microtable automatically. As an example, we do not calculate phylogenetic diversity.
# If you want to add Faith's phylogenetic diversity, use PD = TRUE, this will be a little slow
dataset$cal_alphadiv(PD = FALSE)
# return dataset$alpha_diversity
class(dataset$alpha_diversity)[1] “data.frame”
# save dataset$alpha_diversity to a directory
dir.create("alpha_diversity")
dataset$save_alphadiv(dirpath = "alpha_diversity")We also calculate the distance matrix of beta diversity using function cal_betadiv(). We provide four most frequently used indexes: Bray-curtis, Jaccard, weighted Unifrac and unweighted unifrac.
# If you do not want to calculate unifrac metrics, use unifrac = FALSE
# require GUniFrac package
dataset$cal_betadiv(unifrac = TRUE)
# return dataset$beta_diversity
class(dataset$beta_diversity)
# save dataset$beta_diversity to a directory
dir.create("beta_diversity")
dataset$save_betadiv(dirpath = "beta_diversity")This class is used to transform taxonomic abundance data for plotting the taxa abundance with the ggplot2 package. We first use this class for the bar plot.
# create trans_abund object
# use 10 Phyla with the highest abundance in the dataset.
t1 <- trans_abund$new(dataset = dataset, taxrank = "Phylum", ntaxa = 10)
# t1 object now include the transformed abundance data t1$abund_data and other elements for the following plottingWe remove the sample names in x axis and add the facet to show abundance according to groups.
t1$plot_bar(others_color = "grey70", facet = "Group", xtext_keep = FALSE, legend_text_italic = FALSE)
# return a ggplot2 object# The groupmean parameter can be used to obtain the group-mean barplot.
t1 <- trans_abund$new(dataset = dataset, taxrank = "Phylum", ntaxa = 10, groupmean = "Group")
t1$plot_bar(others_color = "grey70", legend_text_italic = FALSE)Then alluvial plot is implemented in the plot_bar function.
t1 <- trans_abund$new(dataset = dataset, taxrank = "Phylum", ntaxa = 8)
# use_alluvium = TRUE make the alluvial plot, clustering =TRUE can be used to reorder the samples by clustering
t1$plot_bar(use_alluvium = TRUE, clustering = TRUE, xtext_type_hor = FALSE, xtext_size = 6)The box plot is an excellent way to intuitionally show data distribution across groups.
# show 15 taxa at Class level
t1 <- trans_abund$new(dataset = dataset, taxrank = "Class", ntaxa = 15)
t1$plot_box(group = "Group")Then we show the heatmap with the high abundant genera.
# show 40 taxa at Genus level
t1 <- trans_abund$new(dataset = dataset, taxrank = "Genus", ntaxa = 40)
t1$plot_heatmap(facet = "Group", xtext_keep = FALSE, withmargin = FALSE)Then, we show the pie chart.
t1 <- trans_abund$new(dataset = dataset, taxrank = "Phylum", ntaxa = 6, groupmean = "Group")
# all pie chart in one row
t1$plot_pie(facet_nrow = 1)The trans_venn class is used for venn analysis. To analyze the unique and shared OTUs of groups, we first merge samples according to the “Group” column of sample_table.
# merge samples as one community for each group
dataset1 <- dataset$merge_samples(use_group = "Group")
# dataset1 is a new microtable object
# create trans_venn object
t1 <- trans_venn$new(dataset1, ratio = "seqratio")
t1$plot_venn()
# The integer data is OTU number
# The percentage data is the sequence number/total sequence numberWhen the groups are too many to show with venn plot, we can use petal plot.
# use "Type" column in sample_table
dataset1 <- dataset$merge_samples(use_group = "Type")
t1 <- trans_venn$new(dataset1)
t1$plot_venn(petal_plot = TRUE)Sometimes, we are interested in the unique and shared species. In another words, the composition of the unique or shared species may account for the different and similar parts of ecological characteristics across groups[12]. For this goal, we first transform the results of venn plot to the traditional species-sample table, that is, another object of microtable class.
dataset1 <- dataset$merge_samples(use_group = "Group")
t1 <- trans_venn$new(dataset1)
# transform venn results to the sample-species table, here do not consider abundance, only use presence/absence information.
t2 <- t1$trans_venn_com(use_OTUs_frequency = TRUE)
# t2 is a new microtable class, each part is considered as a sample
class(t2)[1] “microtable” “R6”
We use bar plot to show the composition at the Genus level.
# calculate taxa abundance, that is, the frequency
t2$cal_abund()
# transform and plot
t3 <- trans_abund$new(dataset = t2, taxrank = "Genus", ntaxa = 10)
t3$plot_bar(bar_type = "part", legend_text_italic = T, ylab_title = "Frequency (%)", xtext_type_hor = FALSE)We also try to use pie chart to show the compositions at the Phylum level.
t3 <- trans_abund$new(dataset = t2, taxrank = "Phylum", ntaxa = 8)
t3$plot_pie(facet_nrow = 3, use_colors = rev(c(RColorBrewer::brewer.pal(8, "Dark2"), "grey50")))Alpha diversity can be transformed and plotted using trans_alpha class. Creating trans_alpha object can return two data frame: alpha_data and alpha_stat.
t1 <- trans_alpha$new(dataset = dataset, group = "Group")
# return t1$alpha_stat
t1$alpha_stat[1:5, ]| Group | Measure | N | Mean | SD | SE |
|---|---|---|---|---|---|
| CW | Observed | 30 | 1843 | 220.6 | 40.27 |
| CW | Chao1 | 30 | 2553 | 338.1 | 61.73 |
| CW | ACE | 30 | 2716 | 367 | 67.01 |
| CW | Shannon | 30 | 6.308 | 0.5355 | 0.09777 |
| CW | Simpson | 30 | 0.9897 | 0.01305 | 0.002382 |
Then, we test the differences among groups using the KW rank sum test and anova with multiple comparisons.
| Groups | Measure | Test method | p.value | Significance |
|---|---|---|---|---|
| IW vs CW | Observed | KW | 0.0371 | * |
| IW vs TW | Observed | KW | 0.4553 | |
| CW vs TW | Observed | KW | 0.3912 | |
| IW vs CW vs TW | Observed | KW | 0.155 | |
| IW vs CW | Chao1 | KW | 0.002689 | ** |
| Observed | Chao1 | ACE | Shannon | Simpson | InvSimpson | Fisher | Coverage | |
|---|---|---|---|---|---|---|---|---|
| IW | a | a | a | a | a | a | a | b |
| TW | a | ab | b | a | a | a | a | a |
| CW | a | b | b | a | a | a | a | a |
Now, let us plot the mean and se of alpha diversity for each group, and add the duncan.test (agricolae package) result.
We can also use the boxplot to show the paired comparisons directly.
The distance matrix of beta diversity can be transformed and plotted using trans_beta class. The analysis referred to the beta diversity in this class mainly include ordination, group distance, clustering and manova. We first show the ordination using PCoA.
# we first create an object and select PCoA for ordination
t1 <- trans_beta$new(dataset = dataset, group = "Group", measure = "bray", ordination = "PCoA")
# t1$res_ordination is the ordination result list
class(t1$res_ordination)[1] “list”
# plot the PCoA result
t1$plot_ordination(plot_color = "Group", plot_shape = "Group", plot_group_ellipse = TRUE)Then we plot and compare the group distances.
# calculate and plot sample distances within groups
t1$cal_group_distance()
# return t1$res_group_distance
t1$plot_group_distance(distance_pair_stat = TRUE)# calculate and plot sample distances between groups
t1$cal_group_distance(within_group = FALSE)
t1$plot_group_distance(distance_pair_stat = TRUE)Clustering plot is also a frequently used method.
# use replace_name to set the label name, group parameter used to set the color
t1$plot_clustering(group = "Group", replace_name = c("Saline", "Type"))perMANOVA is often used in the differential test of distances among groups.
| Df | SumsOfSqs | MeanSqs | F.Model | R2 | Pr(>F) | |
|---|---|---|---|---|---|---|
| Group | 2 | 6.121 | 3.06 | 10.57 | 0.1955 | 0.001 |
| Residuals | 87 | 25.18 | 0.2895 | NA | 0.8045 | NA |
| Total | 89 | 31.3 | NA | NA | 1 | NA |
| Groups | measure | permutations | R2 | p.value | Significance |
|---|---|---|---|---|---|
| IW vs CW | bray | 999 | 0.1595 | 0.001 | *** |
| IW vs TW | bray | 999 | 0.147 | 0.001 | *** |
| CW vs TW | bray | 999 | 0.1556 | 0.001 | *** |
# manova for specified group set: here "Group + Type"
t1$cal_manova(cal_manova_set = "Group + Type")
t1$res_manova$aov.tab| Df | SumsOfSqs | MeanSqs | F.Model | R2 | Pr(>F) | |
|---|---|---|---|---|---|---|
| Group | 2 | 6.121 | 3.06 | 12.01 | 0.1955 | 0.001 |
| Type | 3 | 3.783 | 1.261 | 4.949 | 0.1208 | 0.001 |
| Residuals | 84 | 21.4 | 0.2548 | NA | 0.6836 | NA |
| Total | 89 | 31.3 | NA | NA | 1 | NA |
Differential abundance test is a very important part in the microbial community data analysis. It can be used to find the significant taxa in determining the community differences across groups. Currently, trans_diff class have three famous approaches to perform this analysis: metastat[13], LEfSe[14] and random forest. Metastat depends on the permutations and t-test and performs well on the sparse data. It is used for the comparisons between two groups.
# metastat analysis at Genus level
t1 <- trans_diff$new(dataset = dataset, method = "metastat", group = "Group", metastat_taxa_level = "Genus")
# t1$res_metastat is the result
# t1$res_metastat_group_matrix is the group comparisons order for plotting
# plot the first paired groups, choose_group = 1
t1$plot_metastat(use_number = 1:10, qvalue = 0.05, choose_group = 1)LEfSe combines the non-parametric test and linear discriminant analysis. We implement this approach in this package instead of the python version.
t1 <- trans_diff$new(dataset = dataset, method = "lefse", group = "Group", alpha = 0.01, lefse_subgroup = NULL)
# t1$res_lefse is the LEfSe result
# t1$res_abund is the abundance information
t1$plot_lefse_bar(LDA_score = 4)| Taxa | Group | pvalue | LDA |
|---|---|---|---|
| k__Bacteria|p__Proteobacteria | CW | 3.21e-11 | 4.834 |
| k__Bacteria|p__Acidobacteria|c__Acidobacteria | IW | 8.559e-13 | 4.787 |
| k__Bacteria|p__Acidobacteria | IW | 5.749e-12 | 4.785 |
| k__Bacteria|p__Bacteroidetes | TW | 1.19e-09 | 4.776 |
| k__Bacteria|p__Proteobacteria|c__Gammaproteobacteria | CW | 5.475e-12 | 4.613 |
We can also plot the abundance of taxa detected using LEfSe.
Then, we show the cladogram of the differential features in the taxonomic tree. There are too many taxa in this dataset. As an example, we only use the highest 200 abundant taxa in the tree and 50 differential features. We only show the full taxonomic label at Phylum level and use letters at other levels to reduce the text overlap.
# clade_label_level 5 represent phylum level in this analysis
# require ggtree package
t1$plot_lefse_cladogram(use_taxa_num = 200, use_feature_num = 50, clade_label_level = 5)There may be a problem related with the taxonomic labels in the plot. When the levels used are too many, the taxonomic labels may have too much overlap. However, if we only indicate the Phylum labels, the taxa in the legend with marked letters are too many. At this time, you can select the taxa that you want to show in the plot manually like the following operation.
# choose some taxa according to the positions in the previous picture; those taxa labels have minimum overlap
use_labels <- c("c__Deltaproteobacteria", "c__Actinobacteria", "o__Rhizobiales", "p__Proteobacteria", "p__Bacteroidetes",
"o__Micrococcales", "p__Acidobacteria", "p__Verrucomicrobia", "p__Firmicutes",
"p__Chloroflexi", "c__Acidobacteria", "c__Gammaproteobacteria", "c__Betaproteobacteria", "c__KD4-96",
"c__Bacilli", "o__Gemmatimonadales", "f__Gemmatimonadaceae", "o__Bacillales", "o__Rhodobacterales")
# then use parameter select_show_labels to show
t1$plot_lefse_cladogram(use_taxa_num = 200, use_feature_num = 50, select_show_labels = use_labels)
# Now we can see that more taxa names appear in the treeIf you are interested in taxonomic tree, you can also use metacoder package[15] to plot the taxonomic tree based on the selected taxa. We do not show the usage here.
The third approach is rf, which depends on the random forest[16, 17] and the non-parametric test. The current method can calculate random forest by bootstrapping like the method in LEfSe and only use the significant features. MeanDecreaseGini is selected as the indicator value in the analysis.
# use Genus level for parameter rf_taxa_level, if you want to use all taxa, change to "all"
# nresam = 1 and boots = 1 represent no bootstrapping and use all samples directly
t1 <- trans_diff$new(dataset = dataset, method = "rf", group = "Group", rf_taxa_level = "Genus")
# t1$res_rf is the result stored in the object
# plot the result
t2 <- t1$plot_diff_abund(use_number = 1:20, only_abund_plot = FALSE)
gridExtra::grid.arrange(t2$p1, t2$p2, ncol=2, nrow = 1, widths = c(2,2))
# the middle asterisk represent the significancesThe environmental variables are very useful in analyzing microbial community structure and assembly mechanisms. We first show the RDA analysis (db-RDA and RDA).
# add_data is used to add the environmental data
t1 <- trans_env$new(dataset = dataset, add_data = env_data_16S[, 4:11])# use bray-curtis distance to do dbrda
t1$cal_rda(use_dbrda = TRUE, use_measure = "bray")
# t1$res_rda is the result list stored in the object
t1$trans_rda(adjust_arrow_length = TRUE, max_perc_env = 10)
# t1$res_rda_trans is the transformed result for plotting
t1$plot_rda(plot_color = "Group")# use Genus
t1$cal_rda(use_dbrda = FALSE, taxa_level = "Genus")
# As the main results of RDA are related with the projection and angles between different arrows,
# we adjust the length of the arrow to show them clearly using several parameters.
t1$trans_rda(show_taxa = 10, adjust_arrow_length = TRUE, max_perc_env = 1500, max_perc_tax = 3000, min_perc_env = 200, min_perc_tax = 300)
# t1$res_rda_trans is the transformed result for plotting
t1$plot_rda(plot_color = "Group")Mantel test can be used to check whether there is significant correlations between environmental variables and distance matrix.
| variable_name | cor_method | corr_res | p_res | significance |
|---|---|---|---|---|
| Temperature | pearson | 0.452 | 0.001 | *** |
| Precipitation | pearson | 0.2791 | 0.001 | *** |
| TOC | pearson | 0.13 | 0.003 | ** |
| NH4 | pearson | -0.05539 | 0.922 | |
| NO3 | pearson | 0.06758 | 0.049 | * |
| pH | pearson | 0.4085 | 0.001 | *** |
| Conductivity | pearson | 0.2643 | 0.001 | *** |
| TN | pearson | 0.1321 | 0.002 | ** |
The correlations between environmental variables and taxa are important in analyzing and inferring the factors affecting community structure. In this example, we first perform the differential abundance test and random forest analysis to obtain the important genera. Then we use those taxa to perform correlation analysis.
# first create trans_diff object
t2 <- trans_diff$new(dataset = dataset, method = "rf", group = "Group", rf_taxa_level = "Genus")
# then create trans_env object
t1 <- trans_env$new(dataset = dataset, add_data = env_data_16S[, 4:11])
# calculate correlation
t1$cal_cor(use_data = "other", p_adjust_method = "fdr", other_taxa = t2$res_rf$Taxa[1:40])
# return t1$res_cor Then, we can plot the correlation results using ggplot2 or pheatmap.
Sometimes, it is necessary to study the correlations between environmental variables and taxa for different groups.
# calculate correlations for different groups using parameter by_group
t1$cal_cor(by_group = "Group", use_data = "other", p_adjust_method = "fdr", other_taxa = t2$res_rf$Taxa[1:40])
# return t1$res_cor
t1$plot_corr()If you are concerned with the relationship between environmental factors and alpha diversity, you can also use this function.
t1 <- trans_env$new(dataset = dataset, add_data = env_data_16S[, 4:11])
# use add_abund_table parameter to add the extra data table
t1$cal_cor(add_abund_table = dataset$alpha_diversity)
t1$plot_corr()In recent decades, the integration of phylogenetic analysis and null model promotes the inference of niche and neutral influences on community assembly more powerfully by adding a phylogeny dimension [18, 19]. The trans_nullmodel class provides an encapsulation, including the calculation of the phylogenetic signal, beta mean pairwise phylogenetic distance (betaMPD), beta mean nearest taxon distance (betaMNTD), beta nearest taxon index (betaNTI), beta net relatedness index (betaNRI) and Bray-Curtis-based Raup-Crick (RCbray). The approach for phylogenetic signal analysis is based on the mantel correlogram [20], in which the change of phylogenetic signal is intuitional and clear compared to other approaches. The algorithms of betaMNTD and betaMPD have been optimized to be faster than those in the picante package [3]. The combinations between RCbray and betaNTI (or betaNRI) can be used to infer the strength of each ecological process dominating the community assembly under the specific hypothesis [19]. This can be achievable by the function cal_process() to parse the percentage of each inferred process. We first check the phylogenetic signal.
# generate trans_nullmodel object; use 1000 OTUs as example
t1 <- trans_nullmodel$new(dataset, taxa_number = 1000, add_data = env_data_16S)# use pH as the test variable
t1$cal_mantel_corr(use_env = "pH")
# return t1$res_mantel_corr
# plot the mantel correlogram
t1$plot_mantel_corr()betaNRI(ses.betampd) is used to show the ‘basal’ phylogenetic turnover[20]. Compared to betaNTI, it can capture more turnover information associated with the deep phylogeny. It is noted that there are many null models with the development in the several decades. In the trans_nullmodel class, we randomized the phylogenetic relatedness of species. This shuffling approach fix the observed levels of species α-diversity and β-diversity to explore whether the observed phylogenetic turnover significantly differ from null model that phylogenetic relatedness among species are random.
# null model run 500 times
t1$cal_ses_betampd(runs=500, abundance.weighted = TRUE)
# return t1$res_ses_betampdIf we want to plot the betaNRI, we can use plot_group_distance function in trans_beta class. For example, the results showed that the mean betaNRI of TW is extremely and significantly larger that those in CW and IW, revealing that the basal phylogenetic turnover in TW is high.
# add betaNRI matrix to beta_diversity list
dataset$beta_diversity[["betaNRI"]] <- t1$res_ses_betampd
# create trans_beta class, use measure "betaNRI"
t2 <- trans_beta$new(dataset = dataset, group = "Group", measure = "betaNRI")
# transform the distance for each group
t2$cal_group_distance()
# plot the results
g1 <- t2$plot_group_distance(distance_pair_stat = TRUE)
g1 + geom_hline(yintercept = -2, linetype = 2) + geom_hline(yintercept = 2, linetype = 2)Sometimes, if you want to perform null model analysis for each group individually, such as one group as one species pool, you can calculate the results for each group, respectively. We can find that, when we perform betaNRI for each group respectively, mean betaNRI between CW and TW are not significantly different, and they are both significantly higher than that in IW, revealing that the strength of variable selection in CW and TW may be similar under the condition that each area is considered as a specific species pool.
# we create a list to store the trans_nullmodel results.
sesbeta_each <- list()
group_col <- "Group"
all_groups <- unique(dataset$sample_table[, group_col])
# calculate for each group, respectively
for(i in all_groups){
# like the above operation, but need provide 'group' and 'select_group'
test <- trans_nullmodel$new(dataset, group = group_col, select_group = i, taxa_number = 1000, add_data = env_data_16S)
test$cal_ses_betampd(runs = 500, abundance.weighted = TRUE)
sesbeta_each[[i]] <- test$res_ses_betampd
}
# merge and reshape to generate one symmetrical matrix
test <- lapply(sesbeta_each, melt) %>% do.call(rbind, .) %>%
reshape2::dcast(., Var1~Var2, value.var = "value") %>% `row.names<-`(.[,1]) %>% .[, -1, drop = FALSE]
# like the above operation
dataset$beta_diversity[["betaNRI"]] <- test
t2 <- trans_beta$new(dataset = dataset, group = "Group", measure = "betaNRI")
t2$cal_group_distance()
g1 <- t2$plot_group_distance(distance_pair_stat = TRUE)
g1 + geom_hline(yintercept = -2, linetype = 2) + geom_hline(yintercept = 2, linetype = 2)BetaNTI(ses.betamntd) can be used to indicate the phylogenetic terminal turnover [19].
# null model run 500 times
t1$cal_ses_betamntd(runs=500, abundance.weighted = TRUE)
# return t1$res_ses_betamntd| S1 | S2 | S3 | S4 | S5 | |
|---|---|---|---|---|---|
| S1 | 0 | -6.554 | -6.563 | -6.308 | -6.153 |
| S2 | -6.554 | 0 | -6.678 | -6.675 | -6.124 |
| S3 | -6.563 | -6.678 | 0 | -6.544 | -6.46 |
| S4 | -6.308 | -6.675 | -6.544 | 0 | -6.356 |
| S5 | -6.153 | -6.124 | -6.46 | -6.356 | 0 |
RCbray (Bray-Curtis-based Raup-Crick) can be calculated using function cal_rcbray() to assess whether the compositional turnover was governed primarily by drift [21]. We applied null model to simulate species distribution by randomly sampling individuals from each species pool with preserving species occurrence frequency and sample species richness [20].
As an example, we also calculate the proportion of the inferred processes on the community assembly as shown in the references [19, 20]. In the example, the fraction of pairwise comparisons with significant betaNTI values (|βNTI| > 2) is the estimated influence of Selection; βNTI > 2 represents the heterogeneous selection; βNTI < -2 represents the homogeneous selection. The value of RCbray characterizes the magnitude of deviation between observed Bray–Curtis and Bray–Curtis expected under the randomization; a value of |RCbray| > 0.95 was considered significant. The fraction of all pairwise comparisons with |βNTI| < 2 and RCbray > +0.95 was taken as the influence of Dispersal Limitation combined with Drift. The fraction of all pairwise comparisons with |βNTI| < 2 and RCbray < -0.95 was taken as an estimate for the influence of Homogenizing Dispersal. The fraction of all pairwise comparisons with |βNTI| < 2 and |RCbray| < 0.95 estimates the influence of Drift acting alone.
# use betaNTI and rcbray to evaluate processes
t1$cal_process(use_betamntd = TRUE)
# return t1$res_process| process | percentage |
|---|---|
| variable selection | 3.995 |
| homogeneous selection | 48.34 |
| dispersal limitation | 0.02497 |
| homogeneous dispersal | 8.539 |
| drift | 39.1 |
Network is a frequently used approach to study the co-occurrence patterns in microbial ecology[22–24]. In this part, we describe all the core contents in the trans_network class. The network construction approaches can be classified into two types: correlation-based and non correlation-based. Several approaches can be used to calculate correlations and significances.
We first introduce the correlation-based network. The parameter cal_cor in trans_network is used for selecting the correlation calculation method.
# Use R base cor.test, slow
t1 <- trans_network$new(dataset = dataset, cal_cor = "base", taxa_level = "OTU", filter_thres = 0.0001, cor_method = "spearman")
# return t1$res_cor_p list; one table: correlation; another: p value# SparCC method, require SpiecEasi package
# SparCC is very slow, so consider filtering more species with low abundance
t1 <- trans_network$new(dataset = dataset, cal_cor = "SparCC", taxa_level = "OTU", filter_thres = 0.001, SparCC_simu_num = 100)# When the OTU number is large, use R WGCNA package to replace R base to calculate correlations
# require WGCNA package
t1 <- trans_network$new(dataset = dataset, cal_cor = "WGCNA", taxa_level = "OTU", filter_thres = 0.0001, cor_method = "spearman")The parameter COR_cut can be used to select the correlation threshold. Furthermore, COR_optimization = TRUE represent using RMT theory to find the optimized correlation threshold instead of the COR_cut[22].
# construct network; require igraph package
t1$cal_network(p_thres = 0.01, COR_optimization = TRUE)
# return t1$res_network# use arbitrary coefficient threshold to contruct network
t1$cal_network(p_thres = 0.01, COR_cut = 0.7)# save network
# open the gexf file using Gephi(https://gephi.org/)
# require rgexf package
t1$save_network(filepath = "network.gexf")We plot the network and present the node colors according to the calculated modules in Gephi.
Now, we show the node colors with the Phylum information and the edges colors with the positive and negative correlations. All the data used has been stored in the network.gexf file, including modules classifications, Phylum information and edges classifications.
| Property | Value |
|---|---|
| Vertex | 407 |
| Edge | 1989 |
| Average_degree | 9.774 |
| Average_path_length | 3.878 |
| Network_diameter | 9 |
| Clustering_coefficient | 0.4698 |
| Density | 0.02407 |
| Heterogeneity | 1.194 |
| Centralization | 0.09908 |
# classify the node; return t1$res_node_type
t1$cal_node_type()
# return t1$res_node_type
# we retain the file for the following example in trans_func part
network_node_type <- t1$res_node_type| z | module | p | taxa_roles | |
|---|---|---|---|---|
| OTU_50 | -1.305 | M2 | 0 | Peripheral nodes |
| OTU_1 | -0.04067 | M2 | 0 | Peripheral nodes |
| OTU_55 | -1.239 | M2 | 0 | Peripheral nodes |
| OTU_13824 | -0.2403 | M2 | 0 | Peripheral nodes |
| OTU_151 | -1.372 | M2 | 0.4444 | Peripheral nodes |
# plot node roles in terms of the within-module connectivity and among-module connectivity
t1$plot_taxa_roles(use_type = 1)Now, we show the eigengene analysis of modules. The eigengene of a module, i.e. the first principal component of PCA, represents the main variance of the abundance in the species of the module.
Then we perform correlation heatmap to show the relationships between eigengenes and environmental factors.
# create trans_env object like the above operation
t2 <- trans_env$new(dataset = dataset, add_data = env_data_16S[, 4:11])
# calculate correlations
t2$cal_cor(add_abund_table = t1$res_eigen)
# plot the correlation heatmap
t2$plot_corr()The function cal_sum_links() is used to sum the links (edge) number from one taxa to another or in the same taxa. The function plot_sum_links() is used to show the result from the function cal_sum_links(). This is very useful to fast see how many nodes are connected between different taxa or within one taxa. In terms of “Phylum” level in the tutorial, the function cal_sum_links() sum the linkages number from one Phylum to another Phylum or the linkages in the same Phylum. So the numbers along the outside of the circular plot represent how many edges or linkages are related with the Phylum. For example, in terms of Proteobacteria, there are roughly total 900 edges associated with the OTUs in Proteobacteria, in which roughly 200 edges connect both OTUs in Proteobacteria and roughly 150 edges connect the OTUs from Proteobacteria with the OTUs from Chloroflexi.
# calculate the links between or within taxonomic ranks
t1$cal_sum_links(taxa_level = "Phylum")
# return t1$res_sum_links_pos and t1$res_sum_links_neg
# require chorddiag package
t1$plot_sum_links(plot_pos = TRUE, plot_num = 10)The subset_network() function can be used to extract a part of nodes and edges among these nodes from the network. In this function, you should provide the nodes you need using the node parameter.
# this return a sub network that contains all nodes of module M1
t1$subset_network(node = t1$res_node_type %>% .[.$module == "M1", ] %>% rownames, rm_single = TRUE)
# return a new network with igraph classThen we show the next implemented network construction approach: SPIEC-EASI (SParse InversE Covariance Estimation for Ecological Association Inference) network in SpiecEasi R package [25].
# cal_cor select NA
t1 <- trans_network$new(dataset = dataset, cal_cor = NA, taxa_level = "OTU", filter_thres = 0.0005)
# require SpiecEasi package https://github.com/zdk123/SpiecEasi
t1$cal_network(network_method = "SpiecEasi")
# see t1$res_networkWe also introduce the third network construction approach: Probabilistic Graphical Models (PGM), which is implemented in julia package FlashWeave[26]. If you want to use this method like the following code, you should first install julia language in your computer and the FlashWeave package, and add the julia in the computer path (see FlashWeave part in https://github.com/ChiLiubio/microeco).
Ecological researchers are usually interested in the the funtional profiles of microbial communities, because functional or metabolic data is powerful to explain the structure and dynamics of microbial communities and to infer the underlying mechanisms. As metagenomic sequencing is complicated and expensive, using amplicon sequencing data to predict functional profiles is a good choice. Several software are often used for this goal, such as PICRUSt[27], Tax4Fun[28] and FAPROTAX[29, 30]. These tools are great to be used for the prediction of functional profiles based on the prokaryotic communities from sequencing results. In addition, it is also important to obtain the functions for each taxa or OTU, not just the whole profile of communities. But it is hard to know exact functions of each OTU. FAPROTAX database is a collection of the traits and characteristics of prokaryotes based on the known research results published in books and literatures. We match the taxonomic information of prokaryotes against this database to identify the traits of prokaryotes on biogeochemical roles. We also implement the FUNGuild [31] and FungalTraits [32] databases to identify the fungal traits.
# Identify microbial traits
# create object of trans_func
t2 <- trans_func$new(dataset)
# mapping the taxonomy to the database
# the function can recognize prokaryotes or fungi automatically.
t2$cal_spe_func()
# return t2$res_spe_func, 1 represent function exists, 0 represent no or cannot confirmed.| methanotrophy | acetoclastic_methanogenesis | |
|---|---|---|
| OTU_4272 | 0 | 0 |
| OTU_236 | 0 | 0 |
| OTU_399 | 0 | 0 |
| OTU_1556 | 0 | 0 |
| OTU_32 | 0 | 0 |
The percentages of the OTUs having the same trait can reflect the functional redundancy of this function in the community or the module in the network.
# calculate the percentages of OTUs for each trait in each module of network
# use_community = FALSE represent calculating module, not community, node_type_table provide the module information
t2$cal_spe_func_perc(use_community = FALSE, node_type_table = network_node_type)
# return t2$res_spe_func_perc
# we only plot some important traits, so we use the default group list to filter and show the traits.
t2$plot_spe_func_perc(select_samples = paste0("M", 1:10))
# M represents module, ordered by the nodes number from high to low# If you want to change the group list, reset the list t2$func_group_list
t2$func_group_list
# use show_prok_func to see the detailed information of prokaryotic traits
t2$show_prok_func("methanotrophy")# calculate the percentages for communities
t2$cal_spe_func_perc(use_community = TRUE)
# t2$res_spe_func_perc[1:5, 1:2]| methanotrophy | acetoclastic_methanogenesis | |
|---|---|---|
| S1 | 0.39 | 0.04 |
| S2 | 0.27 | 0 |
| S3 | 0.48 | 0 |
| S4 | 0.48 | 0 |
| S5 | 0.56 | 0 |
# then we try to correlate the res_spe_func_perc of communities to environmental variables
t3 <- trans_env$new(dataset = dataset, add_data = env_data_16S[, 4:11])
t3$cal_cor(add_abund_table = t2$res_spe_func_perc, cor_method = "spearman")
t3$plot_corr(pheatmap = TRUE)Tax4Fun requires a strict input file demand on the taxonomic information. To analyze the trimmed or changed OTU data in R with Tax4Fun, we provide a link to the Tax4Fun functional prediction.
t1 <- trans_func$new(dataset)
# install Tax4Fun package and download SILVA123 ref data from http://tax4fun.gobics.de/
# decompress SILVA123; provide path in folderReferenceData as you put
t1$cal_tax4fun(folderReferenceData = "./SILVA123")
# return two files: t1$tax4fun_KO: KO file; t1$tax4fun_path: pathway file.
# t1$tax4fun_KO$Tax4FunProfile[1:5, 1:2]| K00001; alcohol dehydrogenase [EC:1.1.1.1] | K00002; alcohol dehydrogenase (NADP+) [EC:1.1.1.2] | |
|---|---|---|
| S1 | 0.0004823 | 5.942e-06 |
| S2 | 0.0005266 | 4.017e-06 |
| S3 | 0.0005054 | 6.168e-06 |
| S4 | 0.0005109 | 5.888e-06 |
| S5 | 0.0005083 | 5.547e-06 |
Now, we use pathway file to analyze the abundance of pathway.
# must transpose to taxa row, sample column
pathway_file <- t1$tax4fun_path$Tax4FunProfile %>% t %>% as.data.frame
# filter rownames, only keep ko+number
rownames(pathway_file) %<>% gsub("(^.*);\\s.*", "\\1", .)
# load the pathway hierarchical metadata
data(ko_map)
# create a microtable object, familiar?
func1 <- microtable$new(otu_table = pathway_file, tax_table = ko_map, sample_table = t1$sample_table)
print(func1)microtable class: sample_table have 90 rows and 4 columns otu_table have 284 rows and 90 columns tax_table have 341 rows and 3 columns
Now, we need to trim data and calculate abundance.
func1$tidy_dataset()
# calculate abundance automatically at three levels: level_1, level_2, level_3
func1$cal_abund()
print(func1)microtable class: sample_table have 90 rows and 4 columns otu_table have 284 rows and 90 columns tax_table have 284 rows and 3 columns Taxa abundance: calculated for level_1,level_2,level_3
Then, we can plot the abundance.
# bar plot at level_1
func2 <- trans_abund$new(func1, taxrank = "level_1", groupmean = "Group")
func2$plot_bar(legend_text_italic = FALSE)We can also do something else. For example, we can use lefse to test the differences of the abundances and find the important enriched pathways across groups.
func2 <- trans_diff$new(dataset = func1, method = "lefse", group = "Group", alpha = 0.05, lefse_subgroup = NULL)
func2$plot_lefse_bar(LDA_score = 3, width = 0.8)Tax4Fun2 [33] is another R package for the prediction of functional profiles of prokaryotic communities from 16S rRNA gene sequences. It can provides two indexes for the evaluation of functional gene redundancies. We provide two functions cal_tax4fun2() and cal_tax4fun2_FRI() in trans_func class for the Tax4Fun2 analysis.
Now, we use the ITS amplicon sequencing dataset as an example to show the use of FUNGuild database[31]. FungalTraits [32] database is also available for identifying fungal traits.
# show the ITS dataset preprocessing, the functional identification of OTUs and functional redundancy of modules
data(sample_info_ITS)
data(otu_table_ITS)
data(taxonomy_table_ITS)
# create microtable object
dataset <- microtable$new(sample_table = sample_info_ITS, otu_table = otu_table_ITS, tax_table = taxonomy_table_ITS)
# remove the taxa not assigned in the Kingdom "k__Fungi"
dataset$tax_table %<>% base::subset(Kingdom == "k__Fungi")
# use tidy_dataset() to make OTUs and samples information consistent across files
dataset$tidy_dataset()
# create trans_network object
t1 <- trans_network$new(dataset = dataset, cal_cor = "WGCNA", taxa_level = "OTU", filter_thres = 0.000001, cor_method = "spearman")
# create correlation network
t1$cal_network(p_thres = 0.05, COR_cut = 0.6)
# calculate node topological properties
t1$cal_node_type()
node_type_table <- t1$res_node_type
# create trans_func object
t2 <- trans_func$new(dataset)
# identify species traits, automatically select database for prokaryotes or fungi
# fungi_database = "FungalTraits" for the FungalTraits database
t2$cal_spe_func(fungi_database = "FUNGuild")
# calculate abundance-unweighted functional redundancy of each trait for each network module
t2$cal_spe_func_perc(use_community = FALSE, node_type_table = node_type_table)
# plot the functional redundancy of network modules
t2$plot_spe_func_perc(select_samples = paste0("M", 1:10))Most of the plots are generated by applying the ggplot2 package. The important parameters in the plotting functions are configured according to our experience. If the inner parameters can not meet the use, the user can add the layers to the plot like the following operation or make the plot using the data (generally data.frame class) stored in the object.
# The groupmean parameter can be used to obtain the group-mean barplot.
t1 <- trans_abund$new(dataset = dataset, taxrank = "Phylum", ntaxa = 10, groupmean = "Group")
g1 <- t1$plot_bar(others_color = "grey70", legend_text_italic = FALSE)
g1 + theme_classic() + theme(axis.title.y = element_text(size = 18))R6 class has a special copy mechanism which is different from S3 and S4. If you want to copy an object completely, you should use function clone instead of direct assignment.
# use clone to copy completely
t1 <- clone(dataset)
t2 <- clone(t1)
t2$sample_table <- NULL
identical(t2, t1)[1] FALSE
# this operation is usually unuseful, because changing t2 will also affect t1
t2 <- t1
t2$sample_table <- NULL
identical(t2, t1)[1] TRUE
We donnot provide the special function to filter samples in microtable class, as we think it is redundant. We recommend user to directly manipulate the sample_table in microtable object. For example, if we want to analyze samples from ‘CW’ and ‘IW’, respectively, we can operate like this:
# remember first clone the full dataset
group1 <- clone(dataset)
group1$sample_table <- subset(group1$sample_table, Group == "CW")
# this is necessary to make files in group1 corresponding
group1$tidy_dataset()
# similar with obove operation
group2 <- clone(dataset)
group2$sample_table <- subset(group2$sample_table, Group == "IW")
group2$tidy_dataset()
# now we get two microtable objects: group1 for CW and group2 for IWAll the classes are set public, meaning that you can change, add or remove the objects stored in them as you want.
To meet the analysis of complicated cases with mixed taxonomic levels, the cal_abund() function in microtable class supports the customized analysis on the selected levels or columns in tax_table and also the selection of relative/absolute abundance.
1. Oksanen J, Blanchet FG, Friendly M, Kindt R, Legendre P, McGlinn D, et al. Vegan: Community ecology package. 2019.
2. Paradis E, Schliep K. Ape 5.0: An environment for modern phylogenetics and evolutionary analyses in R. Bioinformatics 2018; 35: 526–528.
3. Kembel SW, Cowan PD, Helmus MR, Cornwell WK, Morlon H, Ackerly DD, et al. Picante: R tools for integrating phylogenies and ecology. Bioinformatics 2010; 26: 1463–1464.
4. Mcmurdie PJ, Holmes S. Phyloseq: An r package for reproducible interactive analysis and graphics of microbiome census data. Plos One 2013; 8: e61217.
5. Liu C, Cui Y, Li X, Yao M. microeco: An R package for data mining in microbial community ecology. FEMS Microbiol Ecol 2021; 97: fiaa255.
6. Caporaso JG, Kuczynski J, Stombaugh J, Bittinger K, Bushman FD, Costello EK, et al. QIIME allows analysis of high-throughput community sequencing data. Nature Methods 2010; 7: 335–336.
7. Schloss PD, Westcott SL, Ryabin T, Hall JR, Hartmann M, Hollister EB, et al. Introducing mothur: Open-source, platform-independent, community-supported software for describing and comparing microbial communities. Applied and Environmental Microbiology 2009; 75: 7537–41.
8. Bolyen E, Rideout JR, Dillon MR, Bokulich NA, Abnet CC, Al-Ghalith GA, et al. Reproducible, interactive, scalable and extensible microbiome data science using qiime 2. Nat Biotechnol 2019; 37: 852–857.
9. Franzosa EA, McIver LJ, Rahnavard G, Thompson LR, Schirmer M, Weingart G, et al. Species-level functional profiling of metagenomes and metatranscriptomes. Nat Methods 2018; 15: 962–968.
10. An J, Liu C, Wang Q, Yao M, Rui J, Zhang S, et al. Soil bacterial community structure in chinese wetlands. Geoderma 2019; 337: 290–299.
11. Gao C, Montoya L, Xu L, Madera M, Hollingsworth J, Purdom E, et al. Strong succession in arbuscular mycorrhizal fungal communities. The ISME journal 2019; 13: 214.
12. Mendes R, Kruijt M, De BI, Dekkers E, Van der VM, Schneider JH, et al. Deciphering the rhizosphere microbiome for disease-suppressive bacteria. Science 2011; 332: 1097–100.
13. White J, Nagarajan N, Pop M. Statistical methods for detecting differentially abundant features in clinical metagenomic samples. PLoS Computational Biology 2009; 5: e1000352.
14. Segata N, Izard J, Waldron L, Gevers D, Miropolsky L, Garrett WS, et al. Metagenomic biomarker discovery and explanation. Genome Biology 2011; 12: R60.
15. Foster Z, Sharpton TJ, Grunwald NJ. Metacoder: An r package for visualization and manipulation of community taxonomic diversity data. PLOS Computational Biology 2017; 13: e1005404.
16. Beck D, Foster JA. Machine learning techniques accurately classify microbial communities by bacterial vaginosis characteristics. PloS one 2014; 9: e87830.
17. Yatsunenko T, Rey FE, Manary MJ, Trehan I, Dominguez-Bello MG, Contreras M, et al. Human gut microbiome viewed across age and geography. Nature 2012; 486: 222–227.
18. Webb CO, Ackerly DD, McPeek MA, Donoghue MJ. Phylogenies and community ecology. Annu Rev Ecol Syst 2002; 33: 475–505.
19. Stegen JC, Lin X, Fredrickson JK, Chen X, Kennedy DW, Murray CJ, et al. Quantifying community assembly processes and identifying features that impose them. The ISME Journal 2013; 7: 2069–2079.
20. Liu C, Yao M, Stegen JC, Rui J, Li J, Li X. Long-term nitrogen addition affects the phylogenetic turnover of soil microbial community responding to moisture pulse. Scientific Reports 2017; 7: 17492.
21. Chase JM, Kraft NJ, Smith KG, Vellend M, Inouye BD. Using null models to disentangle variation in community dissimilarity from variation in α-diversity. Ecosphere 2011; 2: art24.
22. Deng Y, Jiang Y-H, Yang Y, He Z, Luo F, Zhou J. Molecular ecological network analyses. BMC bioinformatics 2012; 13: 113.
23. Faust K, Raes J. Microbial interactions: From networks to models. Nature Reviews Microbiology 2012; 10: 538–550.
24. Coyte KZ, Schluter J, Foster KR. The ecology of the microbiome: Networks, competition, and stability. Science 2015; 350: 663–666.
25. Kurtz ZD, Muller CL, Miraldi ER, Littman DR, Blaser MJ, Bonneau RA. Sparse and compositionally robust inference of microbial ecological networks. PLoS Comput Biol 2015; 11: e1004226.
26. Tackmann J, Matias Rodrigues JF, Mering C von. Rapid inference of direct interactions in large-scale ecological networks from heterogeneous microbial sequencing data. Cell Systems 2019; 9: 286–296 e8.
27. Langille MG, Zaneveld J, Caporaso JG, McDonald D, Knights D, Reyes JA, et al. Predictive functional profiling of microbial communities using 16S rRNA marker gene sequences. Nature Biotechnology 2013; 31: 814–21.
28. Aßhauer KP, Wemheuer B, Daniel R, Meinicke P. Tax4Fun: Predicting functional profiles from metagenomic 16S rRNA data. Bioinformatics 2015; 31: 2882–2884.
29. Louca S, Jacques SM, Pires AP, Leal JS, Srivastava DS, Parfrey LW, et al. High taxonomic variability despite stable functional structure across microbial communities. Nature Ecology & Evolution 2016; 1: 0015.
30. Louca S, Parfrey LW, Doebeli M. Decoupling function and taxonomy in the global ocean microbiome. Science 2016; 353: 1272.
31. Nguyen NH, Song Z, Bates ST, Branco S, Tedersoo L, Menke J, et al. FUNGuild: An open annotation tool for parsing fungal community datasets by ecological guild. Fungal Ecology 2016; 20: 241–248.
32. Põlme S, Abarenkov K, Henrik Nilsson R, Lindahl BD, Clemmensen KE, Kauserud H, et al. FungalTraits: A user-friendly traits database of fungi and fungus-like stramenopiles. Fungal Diversity 2020; 105: 1–16.
33. Wemheuer F, Taylor JA, Daniel R, Johnston E, Meinicke P, Thomas T, et al. Tax4Fun2: Prediction of habitat-specific functional profiles and functional redundancy based on 16S rRNA gene sequences. Environmental Microbiome 2020; 15.